Computer-generated image processing including volumetric scene reconstruction to replace a designated region

ABSTRACT

An imagery processing system determines pixel color values for pixels of captured imagery from volumetric data, providing alternative pixel color values. A main imagery capture device, such as a camera, captures main imagery such as still images and/or video sequences, of a live action scene. Alternative devices capture imagery of the live action scene, in some spectra and form, and capture information related to pixel color values for multiple depths of a scene, which can be processed to provide reconstruction of an image including replacing a designated region in the image.

CROSS REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/018,943, entitled COMPUTER-GENERATED IMAGE PROCESSING INCLUDINGVOLUMETRIC SCENE RECONSTRUCTION, filed on Sep. 11, 2020 (WD0008US1),which claims the benefit of U.S. Provisional Patent application SerialNo. U.S. Provisional Patent Application Ser. No. 62/983,530, entitledCOMPUTER-GENERATED IMAGE PROCESSING INCLUDING VOLUMETRIC SCENERECONSTRUCTION, filed on Feb. 28, 2020, which is hereby incorporated byreference as if set forth in full in this application for all purposes.

This application is related to the following applications which arehereby incorporated by reference as if set forth in full in thisapplication for all purposes:

U.S. patent application Ser. No. 17/018,960, entitled IMAGE PROCESSINGFOR REDUCING ARTIFACTS CAUSED BY REMOVAL OF SCENE ELEMENTS FROM IMAGES(WD0005US1), filed on Sep. 11, 2020;

U.S. patent application Ser. No. 17/018,933, entitled RECONSTRUCTION OFOBSCURED VIEWS OF CAPTURED IMAGERY USING ARBITRARY CAPTURED INPUTS(WD0006US1), filed on Sep. 11, 2020; and

U.S. patent application Ser. No. 17/018,948, entitled RECONSTRUCTION OFOBSCURED VIEWS IN CAPTURED IMAGERY USING PIXEL REPLACEMENT FROMSECONDARY IMAGERY (WD0009US1), filed on Sep. 11, 2020.

FIELD OF THE INVENTION

The present disclosure generally relates to digital image manipulation.The disclosure relates more particularly to apparatus and techniques forreconstructing frames of video or still images captured of a scene usingvolumetric data captured of the scene.

BACKGROUND

In modern digital imagery creation (still images, video sequences offrames of images), there is often a desire to change from what iscaptured by a camera to convey something different. This might be thecase where a camera captures a scene in which two actors are acting andlater a content creator determines that the presence of one of theactors is to be removed from the captured video to result in a videosequence where the removed actor is not present and instead the videosequence shows what was behind the removed actor, a computer-generatedcharacter or object takes the place of the removed actor, or for otherreasons.

Viewer expectations are that artifacts of the removal from a capturedvideo sequence not be readily apparent. Simply removing the pixelscorresponding to the removed character would leave a blank spot in thevideo. Simply replacing those pixels with a generic background wouldleave artifacts at the boundary between pixels that were part of theremoved character and pixels nearby. With sufficient time, effort andcomputing power, an artist might manually “paint” the pixels in eachframe of the video where the removed character was, but that can be timeconsuming and tedious to get it to where viewers do not perceive anartifact of the removal.

Tools for more simply performing manipulation of imagery data would beuseful.

SUMMARY

An imagery processing system determines pixel color values for pixels ofcaptured imagery from volumetric data, providing alternative pixel colorvalues. A main imagery capture device, such as a camera, captures mainimagery such as still images and/or video sequences, of a live actionscene. Alternative devices capture imagery of the live action scene, insome spectra and form, and capture information related to pixel colorvalues for multiple depths of a scene, which can be processed to providereconstruction of an image of the scene whereby a designated region canbe replaced so as to, for example, remove an actor or object in thescene.

By replacing pixels in the main imagery by selecting amongvolumetrically generated alternatives, plates or views can bereconstructed to include portions of objects in the live action scenethat were obscured in what was captured as part of the main imagery.

The following detailed description together with the accompanyingdrawings will provide a better understanding of the nature andadvantages of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments in accordance with the present disclosure will bedescribed with reference to the drawings, in which:

FIG. 1 illustrates an environment in which imagery and data about ascene might be captured, from a top view, according to variousembodiments.

FIG. 2 illustrates a stage, from a top view, in which a scene iscaptured and has several possible plates of the scene that might be usedin generating reconstructed imagery of what would be visible, accordingto various embodiments.

FIG. 3 is a side view of a scene that might include occlusions to bereconstructed, according to various embodiments.

FIG. 4 is a block diagram of a system for creating reconstructed imageryfrom captured imagery of a scene and arbitrary inputs captured from thescene, according to various embodiments.

FIG. 5 is a flowchart of a process for processing main imagery inputsand alternative inputs to determine replacement sets of pixel colorvalues derived from those alternative inputs.

FIG. 6 illustrates an example of visual content generation system asmight be used to generate imagery in the form of still images and/orvideo sequences of images.

FIG. 7 is a block diagram illustrating an example computer system uponwhich computer systems of the systems illustrated in FIGS. 1 through 6may be implemented.

DETAILED DESCRIPTION

In the following description, various embodiments will be described. Forpurposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the embodiments.However, it will also be apparent to one skilled in the art that theembodiments may be practiced without the specific details. Furthermore,well-known features may be omitted or simplified in order not to obscurethe embodiment being described.

Techniques described and suggested herein include generating modifiedvideo from captured video of a scene and additional inputs related tothe scene, where the modified video is digitally modified to replace allor portions of objects in the scene recorded in the captured video(i.e., “original video”). It should be understood that examplesdescribed with reference to video sequences can apply to single or stillimages, unless otherwise indicated. A scene might comprise variousobjects and actors appearing in the scene, possibly moving, possiblybeing subject to lighting changes and/or camera movements. Herein, wherean object is described as including an object that is visible in thescene or not visible in the scene, the teaching might also apply tohuman and/or non-human actors. Thus, a step in a process that capturesan image of a scene and then processes the digitally captured video toremove an actor from the scene and reconstruct what was supposed to bebehind that actor might also be used for removing inanimate or non-actorobjects from the scene.

Rationales for modifying a video post-capture can vary and many of thetechniques described herein work well regardless of the rationale. Onerationale is that a scene is to be captured with three actorsinteracting where one of the actors is outfitted with motion capture(“mo-cap”) fiducials (contrasting markers, paint, etc.) and the modifiedvideo will have a computer-generated character moving in the scene inplace of the mo-cap actor, such as where the computer-generatedcharacter is a non-human character. Another rationale might be that avideo of a scene is captured and in post-production, a director changesa plot and that change requires that some character or object not bepresent even though it is present in the original captured imagery. Yetanother rationale is the discovery of filming errors that need to becorrected and a scene cannot be easily reshot.

In an embodiment, in addition to capturing a main camera view (e.g., ahero camera view), other on-set devices capture information about thescene that could be used to determine what might be obscured by objectson the set. Specifically, the hero camera view for an original frame canbe considered the result of hero rays passing through pixels in a viewframe until they intersect with an object in the scene and thecorresponding pixel color value used for that pixel of that frame is thecolor value of the intersected object. With additional devices, colorvalues can be obtained and stored of objects that would have beenintersected in the absence of the object that was intersected. If thisadditional color value data is stored such that it can be referenced byhero ray and depth, a reconstructed (i.e., “modified”) frame might begenerated by considering the hero frame and altering the pixel colorvalues to use pixel color values at greater depths along the hero ray.In other words, the color of a pixel is taken from volumetric data to bethe color at a different depth along that pixel's ray than what wasoriginally captured by the main camera.

In a more general case, one or more 2D or 3D images are recreated, whichcould be from a reconstructed plate or a set of normalized, alignedplanes from which an image can be reconstructed with objects filled inor removed. A recreation could use a flattened image, a volumetric filerepresentation, or a multiple depth file representation wherein pixelsof an image are represented by more than one pixel color value atdifferent depths.

The volumetric data might be generated using data captured by aplurality of devices that are positioned about the scene havingdifferent vantage points. Such devices might include witness cameras,Lidar capture devices, cameras/devices looking down from above a scene,camera/devices looking up from below a scene and the like. Thevolumetric data could also include data about atmospheric effects suchas fog that would cause some atmospheric occlusion, and perhaps as afunction of depth.

As part of the reconstruction, geometric processing might be done toreflect the fact that the other devices might be placed at quitedifferent angles from the hero camera. With the other devices viewing ascene from quite different angles, otherwise obscured objects can berecorded and depth for those other objects can be obtained.

In a specific example, a scene might be shot with a hero cameracapturing a main action, while a witness camera is offset from the mainaction by 90 degrees relative to the hero camera, such as where the herocamera is pointed horizontally and the witness camera is directly abovethe main action and pointed downward. The hero camera could be a stereocamera, which could pick up some depth information, but might missinformation about elements obscured by closer objects in the scene. Ineffect, the witness camera can pick up multiple points along each heroray, including determining where along the hero ray an atmosphericeffect begins and ends. This would be useful if, during reconstruction,a near object is to be removed from a scene so that a further object isthen visible but fog needs to be added to account for fog in the scenethat was between the near object and the further object.

If multiple ancillary devices are used, additional depth informationmight be obtained and used as part of the volumetric data. For example,the witness cameras can be stereo cameras. If the volumetric data isstored in a format that indexes by depth, a reconstructed scene can bequickly generated from that volumetric data and the main camera video.Depth information can be used to allow a mapping of visual data ontothree dimensional models or structures that are captured or otherwisedefined within the scene.

FIG. 1 illustrates an environment in which imagery and data about ascene might be captured, from a top view, according to variousembodiments and used to capture and process volumetric data about thescene. FIG. 1 is an approximately top view of a stage 102 on which thereis present actors 104 and 106 and other objects 108, 110, and 112.Action and the scene might be captured by a camera 120, which might bemovable on a track 122. A background wall 124 might provide content ofthe scene that is captured by camera 120, and a green screen 126 mightalso be present and visible in the scene. As is known, green screens canbe added to scenes to facilitate the insertion of content into a framewhere that content does not exist in the scene, but is addedpost-capture of the scene. Camera 120 might be a main camera, a herocamera, that is expected to capture the bulk of the scene. In somevariations, multiple hero cameras are used to allow for cutting from oneview of the scene to another quickly.

In the digital video captured by camera 120 (or later digitized videoderived from analog filming of the scene), for the indicated position ofcamera 120 on track 122, actor 106 would be partially obscured in thevideo by actor 104 and object 110, while background wall 124 ispartially obscured by object 112. To provide a director an option tocast the scene without actor 104 or object 112, the director couldrequest that the entire scene be shot a second time without actor 104and object 112, but often such decisions are not made until after thescene is shot and the actors, objects or environment may no longer beavailable. Artists could manually paint frames to remove an object, butthat can be time consuming to get right.

To provide information for an automated plate reconstruction, additionaldevices might be deployed on or about stage 102 to gather data that canbe used for reconstruction. For example, witness cameras 130, 132 mightbe deployed to capture black and white, high resolution, low resolution,infrared or other particular wavelengths and resolutions of what ishappening in the scene. A Lidar device 140 might also be deployed tocapture point clouds of distances to objects. In some embodiments, aplate can have depth and can define a volume instead of, or in additionto, one or more planar surfaces. In general, operations and propertiesdescribed herein for two-dimensional images may be applicable tothree-dimensional volumes. For example, capturing, manipulating,rendering or otherwise processing two dimensional items, such as images,frames, pixels, etc.; can apply to three-dimensional items such asmodels, settings, voxels, etc. unless otherwise indicated.

It may be that a director or artist desires to use computerized imageryediting tools to edit captured video from camera 120 such that the plateof interest is plate 106. In that case, editing might involve not onlyremoving pixels from frames that correspond to actor 104, but alsofilling in pixel color values for those pixels with what would have beencaptured by camera 120 for those pixels but for the obscuring effects ofthe opacity of actor 104 and object 110.

FIG. 2 illustrates a stage 202, from a top view, in which a scene iscaptured and has several possible plates 204(1)-(4) of the scene thatmight be used in generating reconstructed imagery of what would bevisible and that uses various cameras. As illustrated, cameras206(1)-(3) might be identically configured cameras, while camera 208 isconfigured differently. Such an arrangement, unless existing for otherreasons, might make reconstruction impractical, whereas an arrangementof FIG. 1 might not add complexity if the various different capturedevices are already in place for other reasons. In FIG. 2, camera 208might be placed and optimized for motion capture of action on the stage,such as where one or more of objects 212(1)-(5) present on stage 202 isoutfitted for motion capture. It can be efficient if inputs from camera208 could be used for plate reconstruction, but quite often theinformation gathered, sensitivity, position, lighting, etc. areuncoordinated with those elements of cameras 206(1)-206(3).

FIG. 3 is a side view of a scene that might include occlusions to bereconstructed. In a captured scene 302, a person 304 is between house306 and a camera that captured the image. A plate reconstruction processmight be used to generate, from a video sequence that includes person304 walking in front of house 306, a reconstructed video of a plate thatis behind person 304 so that, for example, the reconstructed video woulddisplay a window 308 on house 306 unobscured by person 304 despite thatthe main camera did not capture all of the pixels that would make up aview of window 308.

FIG. 4 is a block diagram of a system 400 for creating reconstructedimagery from captured imagery of a scene and arbitrary inputs capturedfrom the scene. An advantage of allowing for arbitrary types of input isthat preexisting devices or devices added for other purposes can be usedfor reconstruction. In part, system 400 can be used for reconstructingimagery for captured scenes when editing is done to remove objects fromthe scene that were present when captured. As illustrated, main cameravideo 402 is stored into main scene capture storage 404. Arbitraryinputs 406 can be obtained from other capture devices (mo-cap cameras,contrast cameras, stereo capture devices, Lidar, light sensors,environmental sensors, etc.). A preprocessor 410 obtains referenceinputs 412, reference stage parameters 414, and capture devicepositions/settings 416 and processes those to generate normalizingparameters that can be stored in normalizing parameter storage 420.

In some cases, preprocessing and normalization is not needed. Forexample, where a similar camera is used for a main camera and sidecameras, the pixel color values might not need to be normalized. Theymight need to be translated or transformed linearly or nonlinearly toaccount for different viewing angles, distances, etc. so thatcorresponding pixel sets can be identified.

Reference inputs 412 might include capture device readings obtained of astage in the absence of objects. For example, a Lidar sensor might takereadings of a stage to be able to determine distances to fixedbackgrounds and the like, while an optical density capture device mightmeasure a quiescent optical density in the absence of activity.Reference stage parameters 414 might include measurements made of thestage itself, such as its lighting independent of a capture device,which capture device positions/settings 416 might include calibrationsettings and positions of capture devices relative to a stage. It shouldbe understood that the stage need not be a physical stage, but might besome other environment within which a scene to be captured can occur.For example, where a scene is to be shot of actors in battle outdoors,the stage might be an open field and the cameras and sensing devicesmight be placed relative to that open field to capture the visual actionand capture device inputs.

Normalizing parameters are provided to a normalizer 430 that can processthe arbitrary inputs 406 to generate normalized inputs, which can bestored in a normalized capture data storage 432. The normalized inputsmight be such that they can be used to fill in portions of a stage in ascene that was captured with a main camera that are portions notcaptured in the main camera imagery due to being obscured by objectsthat are to be removed from the captured imagery. But one example ofnormalization would be to modify inputs from another image capturedevice that was capturing light from the scene while the main camera wascapturing the main action, but where lighting, colors, and other factorswould result in the other image capture device capturing pixel colorvalues that are not matched with what would have been captured by themain camera for the plate but for the obscuring objects.

Reconstructing a plate from the main camera capture and normalizedinputs from other capture devices might not be straightforward. In suchcases, a machine-learning reconstructor 440 might take as inputsreconstruction parameters 442, reconstruction input selection 444,normalized capture data from storage 432, and main scene imagery fromstorage 404. Machine-learning reconstructor 440 might be trained onvideo with known values for what should be reconstructed. Once trained,machine-learning reconstructor 440 can output, from those inputs,reconstructed imagery 450. In an embodiment, reconstructed (i.e.,modified) imagery 450 corresponds to the main camera video 402, butwhere portions of a scene that were obstructed by objects to be removedare reconstructed so as to appear as if those removed objects were notpresent in the scene when it was captured.

FIG. 5 is a flowchart of a process pixel replacement for platereconstruction (or possibly for other purposes instead or as well). Forexample, starting with a hero plate video sequence, and inputs fromother sources, the process could allow a video editing system to replacepixel color values a set of pixels (a contiguous region of a view plane,or discontinuous regions) in the hero plate video with pixel colorvalues of a set of alternative pixels, where the alternative pixels havecolor values representing (or matching) pixel color values that wouldhave occurred in the hero place video sequence but for being obscured bysomething in the live action scene. In a specific example, an actor isremoved from a hero plate video sequence by replacing a set of pixels inframes of the hero plate video sequence with pixels having pixel colorvalues captured by another camera that would have been, or are close towhat would have been, the color of pixels capturing an image of abackground behind the actor. The process might be used for platereconstruction from inputs that are not necessarily tied to the detailsof a camera that is capturing a main view of the scene. The processmight be performed by an image processing system or as part of a largerstudio content creation system that might comprise a stage, props,cameras, objects on scene, computer processors, storage, and artist andother user interfaces for working with content that is captured withinthe studio content creation system. In examples below, the process willbe described with reference to an imagery creation system capable ofcapturing images and/or video and modifying the resulting capturedimagery, with or without human user input.

As illustrated in FIG. 5, there are a number of inputs. A hero platevideo 502 might be in the form of a stored digitized stream from a maincamera during a live action scene on a stage (which could be a physicalstage, such as a studio, a sound stage, or a logical stage). Otherdevices that captured varying views of the live action scene to generatealternative imagery, such as a machine vision camera video 504, witnesscamera video 506, clean plate video 508, tile set data 510, videogeometry data 512, textures library data 514, and textured Lidar data516. Textured Lidar data 516 might provide information about geometryfor a set.

As illustrated, hero plate video 502 might be provided to an inputdevice transformer 520 and machine vision camera video 504 might beprovided to a depth estimator 522, which would in turn provide theiroutputs to an image segment generator 524. Image segment generator 524might use those inputs to determine boundaries of objects, such asmoving objects, in the captured main imagery, for later use. Each ofhero plate video 502, machine vision camera video 504, witness cameravideo 506, clean plate video 508, and tile set data 510 might beprovided to a camera and lens solver 530 that might transform each ofthose inputs to account for camera positions, camera zoom, camera pan,and lens distortion, so that there is—at least approximately—somepixel-to-pixel correlation between the main imagery and the alternativeimagery. Outputs of camera and lens solver 530 can be provided to animage selector 540.

Machine vision camera video 504 might be obtained from one or moremachine vision cameras. These machine vision cameras might be those usedfor mo-cap and might be deployed in stereo pairs and might be able toprovide information for depth recovery and geometry recovery. Machinevision camera video 504 might be color, but might be monochrome.

The input device transformer 520 might provide information as to usermasks, such as editor or artist indicators of which areas of frames ofthe hero plate video 502 are to be excluded (or, potentially, marked asdefinitely required, so the system will not automatically excludeobjects that are in fact desired. The input device transformer 520 mightalso perform operations to align color processing applied by differentkinds of camera, if that is not already handled elsewhere.

Image selector 540 selects from among possible pixel replacements for aselected portion of the main imagery video sequence. Image selector 540might automatically select a best option, might perform a voting schemeto select a best option, might generate a weighted blend of more thanone alternative, or other variation and provide a replacement set ofpixels to a geometry projector 550. Image segment generator 524 providessegmentation, such a borders between portions of a main image view thatare to be replaced, and the geometry projector 550 can replace thosepixels accordingly with corresponding pixel color values from one ormore alternative source Image selector 540 might include a userinterface to allow an artist or film editor to select among options,perhaps according to what is most suitable visually.

The geometry projector 550 can store its output into geometry andtexture storage 560. Its output might be a reconstructed image, perhapsbased on Lidar data and camera parameters for a source image, thatcomprises what the source image would be if seen by the camera frame.For example, objects that are common between the reconstructed image andthe source image might appear in the same place in the frame, eventhough the cameras were in different positions.

The stored video might or might not be exactly what is needed. It may bethat undesirable and/or noticeable artifacts remain. An editor canmanually paint to reduce those artifacts, but that can betime-consuming. As illustrated, output from geometry and texture storage560 can be provided to a lighting conformer 562, that in turn has itsoutput processed by a projection image fuser 564, a resolution connector566, a motion blur corrector 568, a focus corrector 570, and a temporalstability aligner 572. Resolution connector 566, motion blur corrector568, and focus corrector 570 might obtain inputs and/or parameters fromtextured Lidar data 516. An output of temporal stability aligner 572might be stored as a synthetic clean plate video sequence in storage580. The synthetic clean plate video sequence might represent the heroplate video with objects appearing to be seamlessly removed from a scenein post-capture editing.

The projection image fuser 564 deals with having multiple images. Thegeometry projector 560 might be applied separately to every sourceimage, resulting in multiple instances for a frame, perhaps dozens orhundreds of instances. The projection image fuser 564 merges themultiple instances of reconstructed frames and the source frame into asingle output frame, which might be a selection of one frame or afunction of more than one frame. In one approach, the projection imagefuser 564 considers candidate inputs and averages them all together. Inanother approach, the projection image fuser 564 considers candidateinputs and selects one having a highest rating according to somecriteria. In yet another approach, the projection image fuser 564considers candidate inputs and selects among images on a pixel-by-pixelbasis according to some criteria. The image selector 540 might be usedto reduce the number of images the projection image fuser 564 (and otherelements in the pipeline shown) need to deal with.

The motion blur corrector 568 might add or remove motion blur so thatmotion blur matches between the image being reconstructed and the sourceimage, e.g., they both would have motion blur or they both will not havemotion blur.

The various embodiments further can be implemented in a wide variety ofoperating environments, which in some cases can include one or more usercomputers, computing devices or processing devices that can be used tooperate any of a number of applications. Such a system also can includea number of workstations running any of a variety of commerciallyavailable operating systems and other known applications. These devicesalso can include virtual devices such as virtual machines, hypervisorsand other virtual devices capable of communicating via a network.

Note that, in the context of describing disclosed embodiments, unlessotherwise specified, use of expressions regarding executableinstructions (also referred to as code, applications, agents, etc.)performing operations that “instructions” do not ordinarily performunaided (e.g., transmission of data, calculations, etc.) denotes thatthe instructions are being executed by a machine, thereby causing themachine to perform the specified operations.

According to one embodiment, the techniques described herein areimplemented by one or more generalized computing systems programmed toperform the techniques pursuant to program instructions in firmware,memory, other storage, or a combination. Special-purpose computingdevices may be used, such as desktop computer systems, portable computersystems, handheld devices, networking devices or any other device thatincorporates hard-wired and/or program logic to implement thetechniques.

For example, FIG. 6 illustrates the example visual content generationsystem 600 as might be used to generate imagery in the form of stillimages and/or video sequences of images. The visual content generationsystem 600 might generate imagery of live action scenes, computergenerated scenes, or a combination thereof. In a practical system, usersare provided with tools that allow them to specify, at high levels andlow levels where necessary, what is to go into that imagery. Forexample, a user might be an animation artist and might use the visualcontent generation system 600 to capture interaction between two humanactors performing live on a sound stage and replace one of the humanactors with a computer-generated anthropomorphic non-human being thatbehaves in ways that mimic the replaced human actor's movements andmannerisms, and then add in a third computer-generated character andbackground scene elements that are computer-generated, all in order totell a desired story or generate desired imagery.

Still images that are output by the visual content generation system 600might be represented in computer memory as pixel arrays, such as atwo-dimensional array of pixel color values, each associated with apixel having a position in a two-dimensional image array. Pixel colorvalues might be represented by three or more (or fewer) color values perpixel, such as a red value, a green value, and a blue value (e.g., inRGB format). Dimensions of such a two-dimensional array of pixel colorvalues might correspond to a preferred and/or standard display scheme,such as 1920-pixel columns by 1280-pixel rows. Images might or might notbe stored in a compressed format, but either way, a desired image may berepresented as a two-dimensional array of pixel color values. In anothervariation, images are represented by a pair of stereo images forthree-dimensional presentations and in other variations, some of theimage output, or all of it, might represent three-dimensional imageryinstead of just two-dimensional views.

A stored video sequence might include a plurality of images such as thestill images described above, but where each image of the plurality ofimages has a place in a timing sequence and the stored video sequence isarranged so that when each image is displayed in order, at a timeindicated by the timing sequence, the display presents what appears tobe moving and/or changing imagery. In one representation, each image ofthe plurality of images is a video frame having a specified frame numberthat corresponds to an amount of time that would elapse from when avideo sequence begins playing until that specified frame is displayed. Aframe rate might be used to describe how many frames of the stored videosequence are displayed per unit time. Example video sequences mightinclude 24 frames per second (24 FPS), 50 FPS, 140 FPS, or other framerates. In some embodiments, frames are interlaced or otherwise presentedfor display, but for clarity of description, in some examples, it isassumed that a video frame has one specified display time, but othervariations might be contemplated.

One method of creating a video sequence is to simply use a video camerato record a live action scene, i.e., events that physically occur andcan be recorded by a video camera. The events being recorded can beevents to be interpreted as viewed (such as seeing two human actors talkto each other) and/or can include events to be interpreted differentlydue to clever camera operations (such as moving actors about a stage tomake one appear larger than the other despite the actors actually beingof similar build, or using miniature objects with other miniatureobjects so as to be interpreted as a scene containing life-sizedobjects).

Creating video sequences for story-telling or other purposes often callsfor scenes that cannot be created with live actors, such as a talkingtree, an anthropomorphic object, space battles, and the like. Such videosequences might be generated computationally rather than capturing lightfrom live scenes. In some instances, an entirety of a video sequencemight be generated computationally, as in the case of acomputer-animated feature film. In some video sequences, it is desirableto have some computer-generated imagery and some live action, perhapswith some careful merging of the two.

While computer-generated imagery might be creatable by manuallyspecifying each color value for each pixel in each frame, this is likelytoo tedious to be practical. As a result, a creator uses various toolsto specify the imagery at a higher level. As an example, an artist mightspecify the positions in a scene space, such as a three-dimensionalcoordinate system, of objects and/or lighting, as well as a cameraviewpoint, and a camera view plane. From that, a rendering engine couldtake all of those as inputs, and compute each of the pixel color valuesin each of the frames. In another example, an artist specifies positionand movement of an articulated object having some specified texturerather than specifying the color of each pixel representing thatarticulated object in each frame.

In a specific example, a rendering engine performs ray tracing wherein apixel color value is determined by computing which objects lie along aray traced in the scene space from the camera viewpoint through a pointor portion of the camera view plane that corresponds to that pixel. Forexample, a camera view plane might be represented as a rectangle havinga position in the scene space that is divided into a grid correspondingto the pixels of the ultimate image to be generated, and if a raydefined by the camera viewpoint in the scene space and a given pixel inthat grid first intersects a solid, opaque, blue object, that givenpixel is assigned the color blue. Of course, for moderncomputer-generated imagery, determining pixel colors—and therebygenerating imagery—can be more complicated, as there are lightingissues, reflections, interpolations, and other considerations.

As illustrated in FIG. 6, a live action capture system 602 captures alive scene that plays out on a stage 604. The live action capture system602 is described herein in greater detail, but might include computerprocessing capabilities, image processing capabilities, one or moreprocessors, program code storage for storing program instructionsexecutable by the one or more processors, as well as user input devicesand user output devices, not all of which are shown.

In a specific live action capture system, cameras 606(1) and 606(2)capture the scene, while in some systems, there might be other sensor(s)608 that capture information from the live scene (e.g., infraredcameras, infrared sensors, motion capture (“mo-cap”) detectors, etc.).On the stage 604, there might be human actors, animal actors, inanimateobjects, background objects, and possibly an object such as a greenscreen 610 that is designed to be captured in a live scene recording insuch a way that it is easily overlaid with computer-generated imagery.The stage 604 might also contain objects that serve as fiducials, suchas fiducials 612(1)-(3), that might be used post-capture to determinewhere an object was during capture. A live action scene might beilluminated by one or more lights, such as an overhead light 614.

During or following the capture of a live action scene, the live actioncapture system 602 might output live action footage to a live actionfootage storage 620. A live action processing system 622 might processlive action footage to generate data about that live action footage andstore that data into a live action metadata storage 624. The live actionprocessing system 622 might include computer processing capabilities,image processing capabilities, one or more processors, program codestorage for storing program instructions executable by the one or moreprocessors, as well as user input devices and user output devices, notall of which are shown. The live action processing system 622 mightprocess live action footage to determine boundaries of objects in aframe or multiple frames, determine locations of objects in a liveaction scene, where a camera was relative to some action, distancesbetween moving objects and fiducials, etc. Where elements have sensorsattached to them or are detected, the metadata might include location,color, and intensity of the overhead light 614, as that might be usefulin post-processing to match computer-generated lighting on objects thatare computer-generated and overlaid on the live action footage. The liveaction processing system 622 might operate autonomously, perhaps basedon predetermined program instructions, to generate and output the liveaction metadata upon receiving and inputting the live action footage.The live action footage can be camera-captured data as well as data fromother sensors.

An animation creation system 630 is another part of the visual contentgeneration system 600. The animation creation system 630 might includecomputer processing capabilities, image processing capabilities, one ormore processors, program code storage for storing program instructionsexecutable by the one or more processors, as well as user input devicesand user output devices, not all of which are shown. The animationcreation system 630 might be used by animation artists, managers, andothers to specify details, perhaps programmatically and/orinteractively, of imagery to be generated. From user input and data froma database or other data source, indicated as a data store 632, theanimation creation system 630 might generate and output datarepresenting objects (e.g., a horse, a human, a ball, a teapot, a cloud,a light source, a texture, etc.) to an object storage 634, generate andoutput data representing a scene into a scene description storage 636,and/or generate and output data representing animation sequences to ananimation sequence storage 638.

Scene data might indicate locations of objects and other visualelements, values of their parameters, lighting, camera location, cameraview plane, and other details that a rendering engine 650 might use torender CGI imagery. For example, scene data might include the locationsof several articulated characters, background objects, lighting, etc.specified in a two-dimensional space, three-dimensional space, or otherdimensional space (such as a 2.5-dimensional space, three-quarterdimensions, pseudo-3D spaces, etc.) along with locations of a cameraviewpoint and view place from which to render imagery. For example,scene data might indicate that there is to be a red, fuzzy, talking dogin the right half of a video and a stationary tree in the left half ofthe video, all illuminated by a bright point light source that is aboveand behind the camera viewpoint. In some cases, the camera viewpoint isnot explicit, but can be determined from a viewing frustum. In the caseof imagery that is to be rendered to a rectangular view, the frustumwould be a truncated pyramid. Other shapes for a rendered view arepossible and the camera view plane could be different for differentshapes.

The animation creation system 630 might be interactive, allowing a userto read in animation sequences, scene descriptions, object details, etc.and edit those, possibly returning them to storage to update or replaceexisting data. As an example, an operator might read in objects fromobject storage into a baking processor that would transform thoseobjects into simpler forms and return those to the object storage 634 asnew or different objects. For example, an operator might read in anobject that has dozens of specified parameters (movable joints, coloroptions, textures, etc.), select some values for those parameters andthen save a baked object that is a simplified object with now fixedvalues for those parameters.

Rather than requiring user specification of each detail of a scene, datafrom the data store 632 might be used to drive object presentation. Forexample, if an artist is creating an animation of a spaceship passingover the surface of the Earth, instead of manually drawing or specifyinga coastline, the artist might specify that the animation creation system630 is to read data from the data store 632 in a file containingcoordinates of Earth coastlines and generate background elements of ascene using that coastline data.

Animation sequence data might be in the form of time series of data forcontrol points of an object that has attributes that are controllable.For example, an object might be a humanoid character with limbs andjoints that are movable in manners similar to typical human movements.An artist can specify an animation sequence at a high level, such as“the left hand moves from location (X1, Y1, Z1) to (X2, Y2, Z2) overtime T1 to T2”, at a lower level (e.g., “move the elbow joint 2.5degrees per frame”) or even at a very high level (e.g., “character Ashould move, consistent with the laws of physics that are given for thisscene, from point P1 to point P2 along a specified path”).

Animation sequences in an animated scene might be specified by whathappens in a live action scene. An animation driver generator 644 mightread in live action metadata, such as data representing movements andpositions of body parts of a live actor during a live action scene. Theanimation driver generator 644 might generate corresponding animationparameters to be stored in the animation sequence storage 638 for use inanimating a CGI object. This can be useful where a live action scene ofa human actor is captured while wearing mo-cap fiducials (e.g.,high-contrast markers outside actor clothing, high-visibility paint onactor skin, face, etc.) and the movement of those fiducials isdetermined by the live action processing system 622. The animationdriver generator 644 might convert that movement data intospecifications of how joints of an articulated CGI character are to moveover time.

A rendering engine 650 can read in animation sequences, scenedescriptions, and object details, as well as rendering engine controlinputs, such as a resolution selection and a set of renderingparameters. Resolution selection might be useful for an operator tocontrol a trade-off between speed of rendering and clarity of detail, asspeed might be more important than clarity for a movie maker to testsome interaction or direction, while clarity might be more importantthan speed for a movie maker to generate data that will be used forfinal prints of feature films to be distributed. The rendering engine650 might include computer processing capabilities, image processingcapabilities, one or more processors, program code storage for storingprogram instructions executable by the one or more processors, as wellas user input devices and user output devices, not all of which areshown.

The visual content generation system 600 can also include a mergingsystem 660 that merges live footage with animated content. The livefootage might be obtained and input by reading from the live actionfootage storage 620 to obtain live action footage, by reading from thelive action metadata storage 624 to obtain details such as presumedsegmentation in captured images segmenting objects in a live actionscene from their background (perhaps aided by the fact that the greenscreen 610 was part of the live action scene), and by obtaining CGIimagery from the rendering engine 650.

A merging system 660 might also read data from rulesets formerging/combining storage 662. A very simple example of a rule in aruleset might be “obtain a full image including a two-dimensional pixelarray from live footage, obtain a full image including a two-dimensionalpixel array from the rendering engine 650, and output an image whereeach pixel is a corresponding pixel from the rendering engine 650 whenthe corresponding pixel in the live footage is a specific color ofgreen, otherwise output a pixel value from the corresponding pixel inthe live footage.”

As used herein, the term “pixel value” can include multiple values orother data associated with rendering a pixel. The data can be organizedas one or more “channels” such as a separate channels for color values,alpha values, pan-chromatic pixel definition, shadow map values, etc.The pixel data can be organized into different formats and may berepresented by any suitable data structure such as in one or morevectors, arrays, lists, strings, etc.

The merging system 660 might include computer processing capabilities,image processing capabilities, one or more processors, program codestorage for storing program instructions executable by the one or moreprocessors, as well as user input devices and user output devices, notall of which are shown. The merging system 660 might operateautonomously, following programming instructions, or might have a userinterface or programmatic interface over which an operator can control amerging process. In some embodiments, an operator can specify parametervalues to use in a merging process and/or might specify specific tweaksto be made to an output of the merging system 660, such as modifyingboundaries of segmented objects, inserting blurs to smooth outimperfections, or adding other effects. Based on its inputs, the mergingsystem 660 can output an image to be stored in a static image storage670 and/or a sequence of images in the form of video to be stored in ananimated/combined video storage 672.

Thus, as described, the visual content generation system 600 can be usedto generate video that combines live action with computer-generatedanimation using various components and tools, some of which aredescribed in more detail herein. While the visual content generationsystem 600 might be useful for such combinations, with suitablesettings, it can be used for outputting entirely live action footage orentirely CGI sequences. The code may also be provided and/or carried bya transitory computer readable medium, e.g., a transmission medium suchas in the form of a signal transmitted over a network.

According to one embodiment, the techniques described herein areimplemented by one or more generalized computing systems programmed toperform the techniques pursuant to program instructions in firmware,memory, other storage, or a combination. Special-purpose computingdevices may be used, such as desktop computer systems, portable computersystems, handheld devices, networking devices or any other device thatincorporates hard-wired and/or program logic to implement thetechniques.

For example, FIG. 7 is a block diagram that illustrates a computersystem 700 upon which the computer systems of the systems describedherein and/or the visual content generation system 600 (see FIG. 6) maybe implemented. The computer system 700 includes a bus 702 or othercommunication mechanism for communicating information, and a processor704 coupled with the bus 702 for processing information. The processor704 may be, for example, a general-purpose microprocessor.

The computer system 700 also includes a main memory 706, such as arandom-access memory (RAM) or other dynamic storage device, coupled tothe bus 702 for storing information and instructions to be executed bythe processor 704. The main memory 706 may also be used for storingtemporary variables or other intermediate information during executionof instructions to be executed by the processor 704. Such instructions,when stored in non-transitory storage media accessible to the processor704, render the computer system 700 into a special-purpose machine thatis customized to perform the operations specified in the instructions.

The computer system 700 further includes a read only memory (ROM) 708 orother static storage device coupled to the bus 702 for storing staticinformation and instructions for the processor 704. A storage device710, such as a magnetic disk or optical disk, is provided and coupled tothe bus 702 for storing information and instructions.

The computer system 700 may be coupled via the bus 702 to a display 712,such as a computer monitor, for displaying information to a computeruser. An input device 714, including alphanumeric and other keys, iscoupled to the bus 702 for communicating information and commandselections to the processor 704. Another type of user input device is acursor control 716, such as a mouse, a trackball, or cursor directionkeys for communicating direction information and command selections tothe processor 704 and for controlling cursor movement on the display712. This input device typically has two degrees of freedom in two axes,a first axis (e.g., x) and a second axis (e.g., y), that allows thedevice to specify positions in a plane.

The computer system 700 may implement the techniques described hereinusing customized hard-wired logic, one or more ASICs or FPGAs, firmwareand/or program logic which in combination with the computer systemcauses or programs the computer system 700 to be a special-purposemachine. According to one embodiment, the techniques herein areperformed by the computer system 700 in response to the processor 704executing one or more sequences of one or more instructions contained inthe main memory 706. Such instructions may be read into the main memory706 from another storage medium, such as the storage device 710.Execution of the sequences of instructions contained in the main memory706 causes the processor 704 to perform the process steps describedherein. In alternative embodiments, hard-wired circuitry may be used inplace of or in combination with software instructions.

The term “storage media” as used herein refers to any non-transitorymedia that store data and/or instructions that cause a machine tooperation in a specific fashion. Such storage media may includenon-volatile media and/or volatile media. Non-volatile media includes,for example, optical or magnetic disks, such as the storage device 710.Volatile media includes dynamic memory, such as the main memory 706.Common forms of storage media include, for example, a floppy disk, aflexible disk, hard disk, solid state drive, magnetic tape, or any othermagnetic data storage medium, a CD-ROM, any other optical data storagemedium, any physical medium with patterns of holes, a RAM, a PROM, anEPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.

Storage media is distinct from but may be used in conjunction withtransmission media. Transmission media participates in transferringinformation between storage media. For example, transmission mediaincludes coaxial cables, copper wire, and fiber optics, including thewires that include the bus 702. Transmission media can also take theform of acoustic or light waves, such as those generated duringradio-wave and infra-red data communications.

Various forms of media may be involved in carrying one or more sequencesof one or more instructions to the processor 704 for execution. Forexample, the instructions may initially be carried on a magnetic disk orsolid-state drive of a remote computer. The remote computer can load theinstructions into its dynamic memory and send the instructions over anetwork connection. A modem or network interface local to the computersystem 700 can receive the data. The bus 702 carries the data to themain memory 706, from which the processor 704 retrieves and executes theinstructions. The instructions received by the main memory 706 mayoptionally be stored on the storage device 710 either before or afterexecution by the processor 704.

The computer system 700 also includes a communication interface 718coupled to the bus 702. The communication interface 718 provides atwo-way data communication coupling to a network link 720 that isconnected to a local network 722. For example, the communicationinterface 718 may be a network card, a modem, a cable modem, or asatellite modem to provide a data communication connection to acorresponding type of telephone line or communications line. Wirelesslinks may also be implemented. In any such implementation, thecommunication interface 718 sends and receives electrical,electromagnetic, or optical signals that carry digital data streamsrepresenting various types of information.

The network link 720 typically provides data communication through oneor more networks to other data devices. For example, the network link720 may provide a connection through the local network 722 to a hostcomputer 724 or to data equipment operated by an Internet ServiceProvider (ISP) 726. The ISP 726 in turn provides data communicationservices through the world-wide packet data communication network nowcommonly referred to as the “Internet” 728. The local network 722 andInternet 728 both use electrical, electromagnetic, or optical signalsthat carry digital data streams. The signals through the variousnetworks and the signals on the network link 720 and through thecommunication interface 718, which carry the digital data to and fromthe computer system 700, are example forms of transmission media.

The computer system 700 can send messages and receive data, includingprogram code, through the network(s), the network link 720, andcommunication interface 718. In the Internet example, a server 730 mighttransmit a requested code for an application program through theInternet 728, ISP 726, local network 722, and communication interface718. The received code may be executed by the processor 704 as it isreceived, and/or stored in the storage device 710, or other non-volatilestorage for later execution.

Operations of processes described herein can be performed in anysuitable order unless otherwise indicated herein or otherwise clearlycontradicted by context. Processes described herein (or variationsand/or combinations thereof) may be performed under the control of oneor more computer systems configured with executable instructions and maybe implemented as code (e.g., executable instructions, one or morecomputer programs or one or more applications) executing collectively onone or more processors, by hardware or combinations thereof. The codemay be stored on a computer-readable storage medium, for example, in theform of a computer program comprising a plurality of instructionsexecutable by one or more processors. The computer-readable storagemedium may be non-transitory. The code may also be provided carried by atransitory computer readable medium e.g., a transmission medium such asin the form of a signal transmitted over a network.

Conjunctive language, such as phrases of the form “at least one of A, B,and C,” or “at least one of A, B and C,” unless specifically statedotherwise or otherwise clearly contradicted by context, is otherwiseunderstood with the context as used in general to present that an item,term, etc., may be either A or B or C, or any nonempty subset of the setof A and B and C. For instance, in the illustrative example of a sethaving three members, the conjunctive phrases “at least one of A, B, andC” and “at least one of A, B and C” refer to any of the following sets:{A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of A, at least one of B and at least one of C eachto be present.

The use of examples, or exemplary language (e.g., “such as”) providedherein, is intended merely to better illuminate embodiments of theinvention and does not pose a limitation on the scope of the inventionunless otherwise claimed. No language in the specification should beconstrued as indicating any non-claimed element as essential to thepractice of the invention.

In the foregoing specification, embodiments of the invention have beendescribed with reference to numerous specific details that may vary fromimplementation to implementation. The specification and drawings are,accordingly, to be regarded in an illustrative rather than a restrictivesense. The sole and exclusive indicator of the scope of the invention,and what is intended by the applicants to be the scope of the invention,is the literal and equivalent scope of the set of claims that issue fromthis application, in the specific form in which such claims issue,including any subsequent correction.

Further embodiments can be envisioned to one of ordinary skill in theart after reading this disclosure. In other embodiments, combinations orsub-combinations of the above-disclosed invention can be advantageouslymade. The example arrangements of components are shown for purposes ofillustration and combinations, additions, re-arrangements, and the likeare contemplated in alternative embodiments of the present invention.Thus, while the invention has been described with respect to exemplaryembodiments, one skilled in the art will recognize that numerousmodifications are possible.

For example, the processes described herein may be implemented usinghardware components, software components, and/or any combinationthereof. The specification and drawings are, accordingly, to be regardedin an illustrative rather than a restrictive sense. It will, however, beevident that various modifications and changes may be made thereuntowithout departing from the broader spirit and scope of the invention asset forth in the claims and that the invention is intended to cover allmodifications and equivalents within the scope of the following claims.

All references, including publications, patent applications, andpatents, cited herein are hereby incorporated by reference to the sameextent as if each reference were individually and specifically indicatedto be incorporated by reference and were set forth in its entiretyherein.

What is claimed is:
 1. A computer-implemented method of processingimages from a main imaging device using capture device inputs fromcapture devices, the method comprising: obtaining a main video sequenceusing a main camera, wherein the main video sequence comprises a mainpixel array derived from a view of a live action scene; obtainingadditional video data of the live action scene from ancillary devices,wherein the ancillary devices are positioned such that at least oneancillary device captures visual information from at least one obscuredobject, wherein the obscured object is obscured, at least in part, fromview of the main camera by a primary object; determining athree-dimensional position of an ancillary device providing at least aportion of the additional data; mapping the additional data intocorresponding volumetric data according to main camera rays from themain camera to objects in the main image; providing the main videosequence including the main pixel array and the corresponding volumetricdata so that when pixels describing the primary object are removed fromthe main pixel array they can be replaced by pixels derived from thevolumetric data; accepting a signal to remove at least a portion of theprimary object from the main pixel array; selecting one or morecorresponding alternative pixel values from the volumetric data; andgenerating a synthetic image comprising a synthetic pixel array whereinpixel color values of the synthetic pixel array are pixel color valuesof pixels from the main pixel array for portions of the primary objectnot replaced and are pixel color values of pixels from the volumetricdata for the replacement portions to result in removing the at least aportion of the primary object from the main pixel array while replacingit with at least a portion of the obscured object that would have beencaptured by the main camera had the at least a portion of the primaryobject not been in the live action scene.
 2. The method of claim 1,wherein the ancillary devices include devices placed on a set atdiffering angles from the main camera and wherein indexing theadditional data into volumetric data comprises performing geometrictransformations to account for differing angles.
 3. The method of claim1, wherein the synthetic image comprises a plurality of discontinuousreplacement regions.
 4. The method of claim 1, wherein the signal isgenerated by a user input device.
 5. The method of claim 1, wherein thesignal is generated by a digital process.
 6. The method of claim 1,further comprising: obtaining user input for determining which objectsin the main image to replace using the volumetric data.
 7. The method ofclaim 1 wherein a pixel value includes two or more channels.
 8. Themethod of claim 7, wherein a channel includes one or more of: colorvalue, alpha value, pan-chromatic definition, shadow map value.
 9. Themethod of claim 1, where the replacement regions include a twodimensional region.
 10. The method of claim 1, where the replacementregions include a three dimensional region.
 11. The method of claim 1,where the main camera includes a digital camera.
 12. The method of claim1, wherein two or more main cameras are used.
 13. The method of claim12, wherein two stereo main cameras are used.
 14. The method of claim 1,where an ancillary device includes one or more of an optical camera,infrared camera, Lidar sensor, sonar sensor, dot flood system.
 15. Themethod of claim 1, where the obscured object is identified at least inpart by tracing a ray from a particular capture device along a view linetoward an object in the main image.
 16. An apparatus using a digitalprocessor to perform the acts of claim
 1. 17. One or moreprocessor-readable non-transitory media including instructionsexecutable by one or more processors to perform the acts of claim 1.